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Abstract 


This technical note presents a new approach to carrying out the kind of exploration achieved by 
Thompson sampling, but without explicitly maintaining or sampling from posterior distributions. 
The approach is based on a bootstrap technique that uses a combination of observed and artificially 
generated data. The latter serves to induce a prior distribution which, as we will demonstrate, is 
critical to effective exploration. We explain how the approach can be applied to multi-armed bandit 
and reinforcement learning problems and how it relates to Thompson sampling. The approach is 
particularly well-suited for contexts in which exploration is coupled with deep learning, since in these 
settings, maintaining or generating samples from a posterior distribution becomes computationally 
infeasible. 


1 Introduction 


To perform well in a sequential decision task while learning about its environment, an agent must 
balance between exploitation, making good decisions given available data, and exploration, taking 
actions that may help to improve the quality of future decisions. Perhaps the most principled 
approach is to compute a Bayes optimal solution, which optimizes the long-run expected rewards 
given prior beliefs. Although conceptually simple, this approach is computationally intractable for 
all but the simplest of problems. As such, engineers typically turn to tractable heuristic exploration 
strategies. 

Upper-confidence bound approaches offer one popular class of exploration heuristics that come with 
performance guarantees. Such approaches assign to poorly-understood actions high but statistically 
plausible values, effectively allocating an optimism bonus to incentivize exploration of each such 
action. For a broad class of problems, if optimism bonuses are well-designed, upper-confidence bound 
algorithms enjoy optimal learning rates. However, designing, tuning, and applying such algorithms 
can be challenging or intractable, and as such, upper-confidence bound algorithms applied in practice 
often suffer poor empirical performance. 

Another popular heuristic, which on the surface appears unrelated, is called Thompson sampling or 
probability matching. In this algorithm the agent maintains a posterior distribution of beliefs and, 
ain each time period, samples an action randomly according to the probability that it is optimal. 
Although this algorithm is one of the oldest recorded exploration heuristics, it received relatively little 
attention until recently when its strong empirical performance was noted, and a host of analytic 
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guarantees followed. It turns out that there is a deep connection between Thompson sampling 
and optimistic algorithms; in particular, as shown in [1], Thompson sampling can be viewed as a 
randomized approximation that approaches the performance of a well-designed and well-tuned upper 
confidence bound algorithm. 

Almost all of the literature on Thompson sampling takes the ability to sample from a posterior dis¬ 
tribution as given. For many commonly used distributions, this is served through conjugate updates 
or Markov chain Monte Carlo methods. However, such methods do not adequately accommodate 
contexts in which models are nonlinearly parameterized in potentially complex ways, as is the case in 
deep learning. In this paper we introduce an alternative approach to tractably attaining the behavior 
of Thompson sampling in a manner that accommodates such nonlinearly parameterized models and, 
in fact, may offer advantages more broadly when it comes to efficiently implementing and applying 
Thompson sampling. 

The approach we propose is based on a bootstrap technique that uses a combination of observed and 
artificially generated data. The idea of using the bootstrap to approximate a posterior distribution 
is not new, and has been noted from inception of the bootstrap concept. Further, the application of 
the bootstrap [2] and other related sub-sampling approaches [3] to approximate Thompson sampling 
is not new. However, we show that these existing approaches fail to ensure sufficient exploration for 
effective performance in sequential decision problems. As we will demonstrate, the way in which we 
generate and use artificial data is critical. The approach is particularly well-suited for contexts in 
which exploration is coupled with deep learning, since in such settings, maintaining or generating 
samples from a posterior distribution becomes computationally infeasible. Further, our approach is 
parallelizable and as such scales well to massive complex problems. We explain how the approach 
can be applied to multi-armed bandit and reinforcement learning problems and how it relates to 
Thompson sampling. 


2 Priors, posteriors, and the bootstrap 


The term bootstrap refers to a class of methods for nonparametric estimation from data-driven sim¬ 
ulation [1]. In essence, the bootstrap uses the empirical distribution of a sampled dataset as an 
estimate of the population statistic. Algorithm provides pseudocode for what is perhaps the most 
common form of bootstrap |1|. We use V{X) to denote the set of probability measures over a set X. 


Algorithm 1 

Bootstrap 

Input: Data xi ,.., xat G A, function cj) : V{X) i—T, A G N 
Output: Probability measure P G 'P(T) 

1: for fc = 1, .., K do 

2: sample data .., from {xi,..., Xn} uniformly with replacement 

3: for all dx C X, let Pk{dx) = J2n=i ^ dx)/N 

4: compute Uk = 4>{.Pk) 

5: end for 

6: For all dy G T, let P{dy) = pTJk G dy)/K 


This procedure allows estimates the distribution of any unknown parameter in a non-parametric 
manner. As described with a function </> of probability measures as input, the algorithm is somewhat 
abstract. However, it easily specializes to familiar concrete versions. For example, suppose (j) is the 
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expectation operator. Then, each sample is the mean of a sample of the data set, and P is the 
relative frequency measure of these sample means. 

The output P is reminiscent of a Bayesian posterior with an extremely weak prior. In fact, with a 
small modification. Algorithm becomes Algorithm the so-called Bayesian bootstrap, for which 
the distribution produced can be interpreted as a posterior based on the data and a degenerate 
Dirichlet prior [5]. 


Algorithm 2 

BayesBootstrap 

Input: Data xi ,.., xn £ A”, function 4> : P{X) ^ y, K £ N 
Output: Probability measure P £ ’P(T) 

1: for fc = 1,.., K do 
2 : sample Wi ,..., ~ Exp(l) 

3: for all dx C T, let Pk{dx) = J2n=l G dx)/J2n=i '^n 

4: compute yk = 4>{.Pk) 

5: end for 

6: For all dy £ y, let P{dy) = J2k=i PVk ^ dy)/K 


With the bootstrap approaches we have described, the support of distributions Pk is restricted to 
the dataset {xi, ..,XAr}. We will show that in sequential decision problems this poses a significant 
problem. To address the problem, we propose a simple extension to the bootstrap. In particular, 
we augment the dataset {xi ,... ,xn} with artificially generated samples {xat+i, .., xn+m} and apply 
the bootstrap to the combined dataset. The artificially generated data can be viewed as inducing a 
prior distribution. In fact, if A is finite and {xat+i, ..,X]\[+m} = A, then as K grows, the distribu¬ 
tion P produced by the Bayesian bootstrap converges to the posterior distribution conditioned on 
{xi,..., xat}, given a uniform Dirichlet prior. 

When the set A is large or infinite, the idea of generating one artificial sample per element does 
not scale gracefully. If we require one prior observation for every single possible value then the 
observed data {xi,..., xat} will bear little influence on the distribution P relative to the artificial 
data {xat+i, ..,xn+m}- To address this, for a selected value of M, we sample the M artificial data 
points xat+i, ..,xn+m from a “prior” distribution Pq. The important thing here is that the relative 
strength M/A of the induced prior can be controlled in an explicit manner. Similarly, depending on 
the choice of the prior sampling distribution Pq the posterior is no longer restricted to finite support. 
In many ways this extension corresponds to using a Dirichlet process prior with generator Pq. 

This augmented bootstrap procedure is especially promising for nonlinear functions such as deep 
neural networks, where sampling from a posterior is typically intractable. In fact, given any method 
of training a neural network on a dataset of size N, we can generate an approximate posterior through 
K bootstrapped versions of the neural network. In its most naive implementation this increases the 
computational cost by a factor of K. However, this approach is parallelizable and therefore scalable 
with compute power. In addition, it may be possible to significantly reduce the computational cost 
of this bootstrap procedure by sharing some of the lower-level features between bootstrap sampled 
networks or growing them out in a tree structure. This could even be implemented on a single 
chip through a specially constructed dropout mask for each bootstrap sample. With a deep neural 
network this could provide significant savings. 
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3 Multi-armed bandit 


Consider a problem in which an agent sequentially chooses actions {At : t G N) from an action set A 
and observes corresponding outcomes (b,ylt ■ t G N). There is a random outcome Yt^a £ Y associated 
with each a G A and time t G N. For each random outcome the agent receives a reward R(Yt^a) 
where -R : T t IR, is a known function. The “true outcome distribution” p* is itself drawn from a 
family of distributions V. We assume that, conditioned on p*, (Yt : t G N) is an iid sequence with 
each element Yt distributed according to p*. Let p* be the marginal distribution corresponding to 
Yt^a- The T -period regret of the sequence of actions Ai,, At is the random variable 
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The Bayesian regret to time T is defined by BayesRegret(r) = E[Regret(T,p*)], where the expecta¬ 
tion is taken with respect to the prior distribution over p*. 


We take all random variables to be defined with respect to a probability space (O, P). We will 

denote by Ht the history of observations (^i, .., At-i,Yt-i^At-i) realized prior to time t. Each 

action At is selected based only on Ht and possibly some external source of randomness. To represent 
this external source, we introduce a sequence of iid random variables {Ut : t G N). Each action At is 
measurable with respect to the sigma-algebra generated by {Ht, Ut). 


The objective is to choose actions in a manner that minimizes Bayesian regret. For this purpose, 
it is useful to think of actions as being selected by a randomized policy vr = (tt^ : t G N), where 
each TTt is a distribution over actions and is measurable with respect to the sigma-algebra generated 
by Ht. An action At is chosen at time t by randomizing according to 7rt{-) = P(At G -{Ht). Our 
bootstrapped Thompson sampling algorithm, presented as Algorithmic serves as such a policy. The 
algorithm uses a bootstrap algorithm, like Algorithms or as a subroutine. Note that the sequence 
of action-observation pairs is not iid, though bootstrap algorithms effectively treat data passed to 
them as iid. The function 'E[p*\Ht+M = •] passed to the bootstrap algorithm maps a specified history 
of t -|- M action-observation pairs to a probability distribution over reward vectors. The resulting 
probability distribution can be thought of as a model fit to the data provided in the history. Note 
that the algorithm takes as input a distribution P from which artificial samples are drawn. This 
can be thought of as a subroutine that generates M action-observation pairs. As a special case, 
this subroutine could generate M deterministic pairs. There can be advantages, though, to using a 
stochastic sampling routine, especially when the space of action-observation pairs is large and we do 
not want to impose too strong a prior. 


Algorithm 3 

BootstrapThompson 

Input: Bootstrap algorithm B, artificial history length M and sampling distribution P 
1 : Ht = 0 
2: for t = 1, 2,.. do 

3: Sample artificial history H = ((Ai, Yj),..., {Am, Ym)) ~ P 

4: Bootstrap sample P = B(R U Ht, W,[p*\Ht+M = •], K = 1) 

5: Sample p ^ P 

6: Select At G argmaxo 

7: Observe Yt.At 

8: Update Ht+i = R* U {At, Yt^At) 

9: end for 
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This algorithm is similar to Thompson sampling, though the posterior sampling step has been 
replaced by a single bootstrap sample. As we will establish in Section |3.2t for several multi-armed 
bandit problems BootstrapThompson with appropriate artificial data is equivalent to Thompson 
sampling. 

One drawback of Algorithm is that the computational cost per timestep grows with the amount 
of data Ht- For applications at large scale this will be prohibitive. Fortunately there is an effective 
method to approximate these bootstrap samples in an online manner at constant computational cost 
[2]. Instead of generating a new bootstrap sample every step we can approximate the bootstrap 
bootstrap distribution by training D € N online bootstrap models in parallel and then sampling 
between them uniformly. In its most naive implementation this parrallel bootstrap will have a 
computational cost per timestep D times larger than a greedy algorithm. However, for specific 
function classes such as neural networks it may be possible to share some computation between 
models and provide significant savings. 


3.1 Simulation results 


We now examine a simple problem instance designed to demonstrate the need for an artificial history 
to incentivize efficient exploration in BootstrapThompson. The action space A = {1,2}, outcomes 
y = [0,1] and rewards R{y) = y. We fix 0 < e <C 1 and describe the true underlying distribution in 
terms of the Dirac delta function 5x{y) which assigns all probability mass to x: 


P*a{y) 


6e{y) if a = 1 

{1 - 2e)6o{y) + 2e6i{y) if a = 2 


The optimal policy is to pick a = 2 at every timestep, since this has an expected reward of 2e instead 
of just e. However, with probability at least 1 — 2e, BootstrapThompson without artificial history 
(M = 0) will never learn the optimal policy. 

To see why this is the case note that BootstrapThompson without artificial history must begin by 
sampling each arm once. In our system this means that with probability 1 — 2e the agent will receive 
a reward of e from arm one and 0 from arm two. Given this history, the algorithm will prefer to 
choose arm one for all subsequent timesteps, since its bootstrap estimates will always put all their 
mass on e and 0 respectively. However, we show that this failing can easily be remedied by the 
inclusion of some artificial history. 

In Figure we plot the cumulative regret of three variants of BootstrapThompson using different 
bootstrap algorithms with M = 0 or M = 2. For our simulations we set e = 0.01 and ran 20 Monte 
Carlo simulations for each variant of the algorithm. In each simulation and each time period, the pair 
of artificial data points for cases with M = 2 was sampled from a distribution P that selects each of 
the two actions once and samples an observation uniformly from [0,1] for each. We found that, in 
this example, the choice of bootstrap method makes little difference but that injecting artificial data 
is crucial to incentivizing efficient exploration. 

The six subplots in Figure present results differentiated by choice of bootstrap algorithms and 
whether or not to include artificial data. The columns indicate which bootstrap algorithm is used with 
labels “Bootstrap” for Algorithm “Bayes” for Algorithm and “BESA” for a recently proposed 
bootstrap approach. The rows indicate whether or not artificial prior data was used. We see that 
BootstrapThompson generally fails to learn with M = 0, however the inclusion of artificial data helps 
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B = Bayes B = BESA B = Bootstrap 



Time step 


Figure 1: Cumulative regret of BootstrapThompson using different bootstrap methods (lower is 
better). Artificial prior data helps to drive efficient exploration. 


to drive efficient exploration. The choice of bootstrap algorithm B seems to make relatively little 
difference, however we do find that Algorithms and seem to outperform BESA on this exampl^ 


3.2 Analysis 

The BootstrapThompson algorithm is similar to Thompson sampling, the only difference being that 
a draw from the posterior distribution is replaced by a bootstrap sample. In fact, we can show that, 
for particular choices of bootstrap algorithm and artificial data distribution, the two algorithms are 
equivalent. 

Consider as an example a multi-armed bandit problem with independent arms, for which each 
ath arm generates rewards from a Bernoulli distribution with mean 9a- Suppose that our prior 
distribution over 9a is Beta(aa,/3a)- Then, the posterior conditioned on observing Uao outcomes 
with reward zero and Uai outcomes with reward one is Beta(aa -|- riaOjPa + nai)- If Oa and /3a are 
positive integers, a sample 9a from this distribution can be generated by the following procedure: 
sample xi,..., Xa,+nai , ?/i, • • ■, ypa+na 2 ~ Exp(l) and let 9a = E* Xi/ (E* Xi + Ej Vj)- This sampling 
procedure is identical to BayesBootstrap (Algorithmic with artificial data generated by a distribution 
P that assigns all probability to a single outcome that, for each arm a, produces aa+l3a data samples, 
with Ua of them associated with reward one and /3a of them associated with reward 0. 

The example we have presented can easily be generalized to the case where each arm generates 
rewards from among a finite set of possibilities with probabilities distributed according to a Dirichlet 
prior. We expect that, with appropriately designed schemes for generating artificial data, such 
equivalences can also be established for a far broader range of problems. 

'^The BESA algorithm is a variant of the bootstrap that applies to two armed bandit problems. In each time period, 
the algorithm estimates the reward of each arm by drawing a sample average (with replacement) with sample size equal 
to the number of times the other arm has been played. This apparently performs well in some settings [3], but the 
approach does not generalize gracefully to settings with dependent arms. 
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The aforementioned equivalencies imply that theoretical regret bounds previously developed for 
Thompson sampling [BE! apply to the BootstrapThompson algorithm with the Bayesian bootstrap 
and appropriately generated artificial data. 


4 Reinforcement learning 


In reinforcement learning, actions taken by the agent can impose delayed consequences. This makes 
the design of exploration strategies more challenging than for multi-armed bandit problems. To fix 
a context, consider an agent that interacts with an environment over repeated episodes of length r. 
In each time period t = 1, ..,r of each episode episode I = 1,2,.., the agent observes a state su and 
selects an action an according to a policy tt which maps states to actions. A reward of ru and state 
transition to su+i are then realized. The agent’s goal is to maximize the long term sum of expected 
rewards, even though she is initially unsure of the system dynamics and reward structure. 

A common approach to reinforcement learning involves learning a state-action value function Q, 
which for each time t, state s, and action a, provides an estimate Qt{s, a) of expected rewards over 
the remainder of the episode: n t-. Given a state-action value function Q, it is 

natural for the agent to select an action that maximizes Qtis, a) when at state s at time t. 

There is a large literature on reinforcement learning algorithms which balance exploration with 
exploitation in a variety of ways 13 Eli- However, the vast majority of these algorithms operate in 
the “tabula rasa” setting, which does not allow for generalization between state-action pairs. For most 
practical systems where the numbers of states and actions is very large or even infinite the ability to 
generalize is crucial for good performance. Of those algorithms which do combine generalization with 
exploration, many require an intractable model-based planning step, or are restricted to unrealistic 
parametric domains uniiiBiia- 

By contrast, some of the most successful applications of reinforcement learning generalize using 
nonlinearly parameterized models, like deep neural networks, that approximate the state-action value 
function mm- These algorithms have attained superhuman performance and generated excitement 
for a new wave of artificial intelligence, but still fail at simple tasks that require efficient exploration 
since they use simple exploration schemes that do not adequately account for the possibility of delayed 
consequences. Recent research has shown how to combine efficient generalization and exploration via 
randomized linearly parameterized value functions (Ej. The approach presented in m can be viewed 
as a variant of Thompson sampling applied in a reinforcement learning context. But this approach 
m does not serve the needs of nonlinear parameterizations. What we present now, as Algorithm 
is a version of Thompson sampling that does serve such needs via leveraging the bootstrap and 
artificial data. 
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Algorithm 4 

Reinforcement Learning with Bootstrapped Value Function Randomization 

1: Input: Bootstrap algorithm B, value function approximator (j), 

2: number of artificial episodes M, sampling distribution P 

3: R=() 

4: for episode I = 1,2,.. do 

5: Sample H — ,... si-t-, uit, ,..., (sji/ 1 , o^mi^ ^mi^ ■ • ■ ^Mr, ^Mt, ^MtY) ^ P 

6 : Bootstrap sample P B(R UR, (j), K = 1) 

7: Sample Q P 

8: for time t = 1,r do 

9: Select action an G argmax^ Qt{sit, a) 

10: Observe reward rn, transition to 

11: end for 

12: Update H ^ HU (sa, an, r/i,..., sir, air, nr) 

13: end for 


In the context of our episodic setting, each element of the data set corresponds to a sequence of 
observations made over an episode. The algorithm takes as input a function cj), which should itself 
be viewed as an algorithm that estimates the state-action value function from this data set. For 
example, cj) could output a deep neural network trained to fit a state-action value function via least- 
squares value iteration. A number of conventional reinforcement learning algorithms would fit the 
state-action value function to the observed history H. Two key features that distinguishes Algorithm 
l^is that the state-action value function is fit to a random subsample of data and that this subsample 
is drawn from a combination of historical and artificial data. 

Before the beginning of each episode, the algorithm applies (j) to generate a randomized state- 
action value function. The agent then follows the greedy policy with respect to that sample over 
the entire episode. As is more broadly the case with Thompson sampling, the algorithm balances 
exploration with exploitation through the randomness of these samples. The algorithm enjoys the 
benefits of what we call deep exploration in that it sometimes selects actions which are neither 
exploitative nor informative in themselves, but that are oriented toward positioning the agent to 
gain useful information downstream in the episode. In fact, the general approach represented by 
this algorithm may be the only known computationally efficient means of achieving deep exploration 
with nonlinearly parameterized representations such as deep neural networks. 

As discussed earlier, the inclusion of an artificial history can be crucial to incentivize proper ex¬ 
ploration in multi-armed bandit problems. The same is true for reinforcement learning. One simple 
approach to generating artificial data that accomplishes this in the context of Algorithm is to 
sample state-action pairs from a diffusely mixed generative model and assign them stochastically 
optimistic rewards (see m for a definition) and random state transitions. When prior data is avail¬ 
able from episodes of experience with actions selected by an expert agent, one can also augment this 
artificial data with that history of experience. This offers a means of incorporating apprenticeship 
learning as a springboard for the learning process. 

Fitting a model like a deep neural network can itself be a computationally expensive task. As 
such it is desirable to use incremental methods that incorporate new data samples into the fitting 
process as they appear, without having to refit the model from scratch. It is important to observe 
that a slight variation of Algorithm accommodates this sort of incremental fitting by leveraging 
parallel computation. This variation is presented as Algorithms and The algorithm makes use 
of an incremental model learning method (f>, which takes as input a current model, previous data set, 






and new data point, with a weight assigned to each data point. Algorithm maintains K models 
(for example, K deep neural networks), incrementally updating each in parallel after observing each 
episode. The model used to guide action in an episode is sampled uniformly from the set of K. 
It is worth noting that this is akin to training each model using experience replay, but with past 
experiences weighted randomly to induce exploration. 


Algorithm 5 

Increment alB ayes Bo o t st rapSample 

Input: Data xi ,.., xn £ A”, weights rci,..., wn-i £ function 4> : V{X) i—)• y 
Output: Sampled weight wn, sampled outcome y 
1: sample wn ^ Exp(l) 

2: for all dx C X, let P{dx) = G dx)/'^^^-^ Wn 

3: compute y = (j){P) 


Algorithm 6 

Incremental Reinforcement Learning with Bootstrapped Value Function Randomization 
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Input: Incremental bootstrap algorithm B, value function approximator </>, 
number of models K, initial models Q^, 
number of artificial episodes M, sampling distribution P 
Par for bootstrap sample k = 1,..., A 
= 0 ^ 

Sample H'^ = ff^),..., {SMi,dMi,fMi^- • ■ “Mr: 

EndParfor 

for episode Z = 1,2,.. do 

Par for bootstrap sample k = 1,... ,K 

Bootstrap sample ^ B(R^ U H^, (j), Q^) 

EndParfor 

Sample k ^ unif(l,..., K) 
for time t = 1,.., r do 

Select action an G argmaxo, Qf{sit,a) 

Observe reward rn, transition to 

end for 

Par for bootstrap sample k = 1,... ,K 

Update U (s/i, an, Ui, ■ • ■, s/r, air, rw) 


^Mt)) 


EndParfor 
end for 


P 
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